Inspired by Julia Silge

The data for this analysis comes from #tidytuesday dataset on cocktail recipes. We’ll perform different unsupervised dimensionality reduction methods on different cocktail recipes. As mentioned in the source website, we need to pay attention to the measure column.

Exploratory Data Analysis

boston_cocktails <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv')
boston_cocktails %>% count(ingredient, sort = TRUE)
## # A tibble: 569 x 2
##    ingredient            n
##    <chr>             <int>
##  1 Gin                 176
##  2 Fresh lemon juice   138
##  3 Simple Syrup        115
##  4 Vodka               114
##  5 Light Rum           113
##  6 Dry Vermouth        107
##  7 Fresh Lime Juice    107
##  8 Triple Sec          107
##  9 Powdered Sugar       90
## 10 Grenadine            85
## # … with 559 more rows

Looking at the ingredient column, we may need to do some more data clean up. Some terms are in lower cases, upper cases, etc which make this column inconsistent. Furthermore, we will look at the measure column.

cocktails_parsed <- boston_cocktails %>%
  mutate(
    ingredient = str_to_lower(ingredient),
    ingredient = str_replace_all(ingredient, "-", " "),
    ingredient = str_remove(ingredient, " liqueur$"),
    ingredient = str_remove(ingredient, " (if desired)$"),
    ingredient = case_when(
      str_detect(ingredient, "bitters") ~ "bitters",
      str_detect(ingredient, "lemon") ~ "lemon juice",
      str_detect(ingredient, "lime") ~ "lime juice",
      str_detect(ingredient, "grapefruit") ~ "grapefruit juice",
      str_detect(ingredient, "orange") ~ "orange juice",
      TRUE ~ ingredient
    ),
    measure = case_when(
      str_detect(ingredient, "bitters") ~ str_replace(measure, "oz$", "dash"), # make assumption on bitters measurement, there may be some error on collecting this bitter measurement
      TRUE ~ measure
    ),
    measure = str_replace(measure, " ?1/2", ".5"),
    measure = str_replace(measure, " ?3/4", ".75"),
    measure = str_replace(measure, " ?1/4", ".25"),
    measure_number = parse_number(measure),
    measure_number = if_else(str_detect(measure, "dash$"),
      measure_number / 50,
      measure_number
    )
  ) %>%
  add_count(ingredient) %>%
  filter(n > 15) %>%
  select(-n) %>%
  distinct(row_id, ingredient, .keep_all = TRUE) %>%
  na.omit()

cocktails_parsed
## # A tibble: 2,542 x 7
##    name     category   row_id ingredient_numb… ingredient measure measure_number
##    <chr>    <chr>       <dbl>            <dbl> <chr>      <chr>            <dbl>
##  1 Gauguin  Cocktail …      1                1 light rum  2 oz              2   
##  2 Gauguin  Cocktail …      1                3 lemon jui… 1 oz              1   
##  3 Gauguin  Cocktail …      1                4 lime juice 1 oz              1   
##  4 Fort La… Cocktail …      2                1 light rum  1.5 oz            1.5 
##  5 Fort La… Cocktail …      2                2 sweet ver… .5 oz             0.5 
##  6 Fort La… Cocktail …      2                3 orange ju… .25 oz            0.25
##  7 Fort La… Cocktail …      2                4 lime juice .25 oz            0.25
##  8 Cuban C… Cocktail …      4                1 lime juice .5 oz             0.5 
##  9 Cuban C… Cocktail …      4                2 powdered … .5 oz             0.5 
## 10 Cuban C… Cocktail …      4                3 light rum  2 oz              2   
## # … with 2,532 more rows
cocktails_df <- cocktails_parsed %>%
  select(-ingredient_number, -row_id, -measure) %>%
  pivot_wider(names_from = ingredient, values_from = measure_number, values_fill = list(measure_number = 0)) %>%
  janitor::clean_names() %>%
  na.omit()

cocktails_df
## # A tibble: 937 x 42
##    name  category light_rum lemon_juice lime_juice sweet_vermouth orange_juice
##    <chr> <chr>        <dbl>       <dbl>      <dbl>          <dbl>        <dbl>
##  1 Gaug… Cocktai…      2           1          1               0           0   
##  2 Fort… Cocktai…      1.5         0          0.25            0.5         0.25
##  3 Cuba… Cocktai…      2           0          0.5             0           0   
##  4 Cool… Cocktai…      0           0          0               0           1   
##  5 John… Whiskies      0           1          0               0           0   
##  6 Cher… Cocktai…      1.25        0          0               0           0   
##  7 Casa… Cocktai…      2           0          1.5             0           0   
##  8 Cari… Cocktai…      0.5         0          0               0           0   
##  9 Ambe… Cordial…      0           0.25       0               0           0   
## 10 The … Whiskies      0           0.5        0               0           0   
## # … with 927 more rows, and 35 more variables: powdered_sugar <dbl>,
## #   dark_rum <dbl>, cranberry_juice <dbl>, pineapple_juice <dbl>,
## #   bourbon_whiskey <dbl>, simple_syrup <dbl>, cherry_flavored_brandy <dbl>,
## #   light_cream <dbl>, triple_sec <dbl>, maraschino <dbl>, amaretto <dbl>,
## #   grenadine <dbl>, apple_brandy <dbl>, brandy <dbl>, gin <dbl>,
## #   anisette <dbl>, dry_vermouth <dbl>, apricot_flavored_brandy <dbl>,
## #   bitters <dbl>, straight_rye_whiskey <dbl>, benedictine <dbl>,
## #   egg_white <dbl>, half_and_half <dbl>, vodka <dbl>, grapefruit_juice <dbl>,
## #   blended_scotch_whiskey <dbl>, port <dbl>, white_creme_de_cacao <dbl>,
## #   citrus_flavored_vodka <dbl>, whole_egg <dbl>, egg_yolk <dbl>,
## #   blended_whiskey <dbl>, dubonnet <dbl>, blanco_tequila <dbl>,
## #   old_mr_boston_dry_gin <dbl>

Principal Component Analysis

We can use the commonly used prcomp for PCA analysis, but we’ll follow Julia Silge’s approach which uses tidymodels. This is also easily extendable/usable for other models.

library(tidymodels)

pca_rec <- recipe(~ ., data = cocktails_df) %>% 
  update_role(name, category, new_role = "id") %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors())
pca_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          2
##  predictor         40
## 
## Operations:
## 
## Centering and scaling for all_predictors
## No PCA components were extracted.
pca_prep <- prep(pca_rec)
pca_prep
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          2
##  predictor         40
## 
## Training data contained 937 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for light_rum, lemon_juice, ... [trained]
## PCA extraction with light_rum, lemon_juice, ... [trained]
tidied_pca <- tidy(pca_prep, 2)

tidied_pca %>% 
  filter(component %in% paste0("PC", 1:5)) %>% 
  mutate(component = fct_inorder(component)) %>% 
  ggplot(aes(value, terms, fill = terms)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~component, nrow = 1) + 
  labs(y =NULL)

- PC1: powdered sugar vs. simple syrup; recipes are not likely to have both. Let’s zoom in on the first four components, and understand which cocktail ingredients contribute in the positive and negative directions.

# need to reorder again, so it'll be sorted based on values
tidied_pca %>% 
  filter(component %in% paste0("PC", 1:4)) %>%
  group_by(component) %>% 
  top_n(8, abs(value)) %>% 
  ungroup() %>% 
  ggplot(aes(reorder(terms, abs(value)), abs(value), fill = value > 0)) +
  geom_bar(stat="identity") + 
  coord_flip() + 
  facet_wrap(~component, scales = "free_y") +
  labs(y = NULL, x = NULL, fill = "Positive?")

- PC1: powdered sugar + egg + gin drinks vs. simple syrup + lime + tequila drinks. This is the component that explains the most variation in drinks. - PC2: mostly about vermouth, both sweet and dry.

We can then observe more specifically PC1 and PC2.

juice(pca_prep) %>%
  ggplot(aes(PC1, PC2, label = name)) +
  geom_point(aes(color = category), alpha = 0.7, size = 2) +
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL)

- Fizzy, egg, powdered sugar drinks are to the left. - Simple syrup, lime, tequila drinks are to the right. - Vermouth drinks are more to the top.

Let’s look at PC2 and PC3.

juice(pca_prep) %>%
  ggplot(aes(PC2, PC3, label = name)) +
  geom_point(aes(color = category), alpha = 0.7, size = 2) +
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL)

It’s less obvious as compared to PC1 and PC2. How about PC1 and PC3?

juice(pca_prep) %>%
  ggplot(aes(PC1, PC3, label = name)) +
  geom_point(aes(color = category), alpha = 0.7, size = 2) +
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL)

Most of cocktail classics are found on the left. We can explore further on different combinations of principal components.

UMAP

library(embed)

umap_rec <- recipe(~., data = cocktails_df) %>%
  update_role(name, category, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_umap(all_predictors())

umap_prep <- prep(umap_rec)

umap_prep
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          2
##  predictor         40
## 
## Training data contained 937 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for light_rum, lemon_juice, ... [trained]
## UMAP embedding for light_rum, lemon_juice, ... [trained]

We can also do similar analysis just like what we had for PCA i.e. analysing the different principal components (UMAP components here).

juice(umap_prep) %>%
  ggplot(aes(umap_1, umap_2, label = name)) +
  geom_point(aes(color = category), alpha = 0.7, size = 2) +
  geom_text(check_overlap = TRUE, hjust = "inward") +
  labs(color = NULL)

t-SNE

t-SNE is also another increasingly popular method to do dimensionality reduction. Inspired by Joshua Cook’s analysis, we’ll use Rtsne library to run t-SNE analysis.

library(Rtsne)
tsne_cocktails <- Rtsne(cocktails_df, perplexity = 5, check_duplicates = FALSE)

tsne_cocktails_tib <- as_tibble(tsne_cocktails$Y) %>%
    set_names(c("z1", "z2")) %>%
    mutate(category = cocktails_df$category)

Let’s visualize our t-SNE results

tsne_cocktails_summary <- tsne_cocktails_tib %>%
    group_by(category) %>%
    summarise(avg_z1 = mean(z1), avg_z2 = mean(z2)) %>%
    ungroup()

tsne_cocktails_tib %>%
    ggplot(aes(z1, z2)) +
    geom_point(aes(color = category)) +
    geom_label(aes(label = category, x = avg_z1, y = avg_z2),
               data = tsne_cocktails_summary,
               label.size = 0, 
               label.padding = unit(1, "mm"),
               fill = "white", 
               fontface = "bold",
               alpha = 0.8) +
    #scale_color_brewer(palette = "Set2") +
    theme(
        plot.title = element_text(hjust = 0.5),
        legend.position = "none"
    ) +
    labs(x = "Dim 1",
         y = "Dim 2",
         title = "PCA and t-SNE of drinks by ingredients")

We can observe some clear separate clusters, but some categories like Vodka should not be close to Non-alcoholic drinks. We can tune around the different parameters that t-SNE algorithm has such as its perplexity, but what we currently have is good enough for basic data exploratory analysis.